CAGEF_services_slide.png

Lecture 03: Tidying your data test


0.1.0 About Introduction to R

Introduction to R is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, links to the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with R by coding along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into R - when required. This learning approach will allow you to spend the time coding and not taking notes!

As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.

0.1.1 Where is this course headed?

We'll take a blank slate approach here to R and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:

and get you to a point where you can:

data-science-explore.png

0.1.2 How do we get there? Step-by-step.

In the first two lessons, we will talk about the basic data structures and objects in R, get cozy with the RStudio environment, and learn how to get help when you are stuck. Because everyone gets stuck - a lot! Then you will learn how to get your data in and out of R, how to tidy our data (data wrangling), subset and merge data, and generate descriptive statistics. Next will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. After that, we will make all sorts of plots for both data exploration and publication. Lastly, we will learn to write customized functions and apply more advanced statistical tests, which really can save you time and help scale up your analyses.

Draw_an_Owl-2.jpg

The structure of the class is a code-along style: It is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don't have to spend your attention on taking notes.


0.2.0 Class Objectives

This is the third in a series of seven lectures. Last lecture we focused in on manipulating data frames with the help of the dplyr package. This week we'll dig deeper into the tidyverse and how we can clean up our data. At the end of this session you will be familiar with the principles of tidy data; subsetting and transforming your data to perform simple calculations; and converting your data between wide and long formats. Our topics are broken into:

  1. Learn tidy data principles and how to convert from wide-format to long-format data.
  2. Learn how to cut and query our dataset for different information.
  3. Learn how to revert data back from long-format to wide-format.

0.3.0 A legend for text format in Jupyter markdown

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Geen boxes: Recommended reads and resources to learn R

0.4.0 Lecture and data files used in this course

0.4.1 Weekly Lecture and skeleton files

Each week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto JupyterHub. You will need to use your UTORid credentials to complete the login process. From there you will find each week's lecture files in the directory /2021-09-IntroR/Lecture_XX. You will find a partially coded skeleton.ipynb file as well as all of the data files necessary to run the week's lecture.

Alternatively, you can download the Jupyter Notebook (.ipynb) and data files from JupyterHub to your personal computer if you would like to run independently of the JupyterHub.

0.4.2 Live-coding HTML page

A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!

0.4.3 Post-lecture PDFs and Recordings

As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus. A recorded version of the lecture will be made available through the University's MyMedia website and a link will be posted in the Discussion section of Quercus.


0.4.4 Dataset 1: Latrines

Sequencing of the V3-V5 hypervariable regions of the 16S rRNA gene

16S rRNA gene amplicon sequencing of 30 latrines from Tanzania and Vietnam at different depths (multiples of 20cm). Microbial abundance is represented in Operational Taxonomic Units (OTUs). Operational Taxonomic Units (OTUs) are groups of organisms defined by a specified level of DNA sequence similarity at a marker gene (e.g. 97% similarity at the V4 hypervariable region of the 16S rRNA gene). Intrinsic environmental factors such as pH, temperature, organic matter composition were also recorded.

0.4.4.1: data/taxa_pitlatrine_wide.csv

An OTU-taxa table.(Naming conventions: [Country_LatrineNo_Depth]) with sample names and environmental variables.

0.4.4.2: Data source

B Torondel, JHJ Ensink, O Gunvirusdu, UZ Ijaz, J Parkhill, F Abdelahi, V-A Nguyen, S Sudgen, W Gibson, AW Walker, and C Quince. Assessment of the influence of intrinsic environmental and geographical factors on the bacterial ecology of pit latrines Microbial Biotechnology, 9(2):209-223, 2016. DOI:10.1111/1751-7915.12334

latrines_wide_to_long.png


0.4.5 Dataset 2: Microbes

This dataset is the result of 16S rRNA gene amplicon sequencing of samples from microbial communities cultured in fresh, brackish, or saline media. Treatments received the aromatic compounds toluene or pyrene as the sole source of carbon and energy. Controls did not receive any compounds (substrate-free) to account for any alternative carbon sources present in the media. The objective of this experiment was to evaluate which microorganisms would make use of toluene and pyrene.

0.4.5.1: data/microbes_wide.csv

Taxa table with gene sequences (ASV) and sample information. This file must be converted from wide to long format before use.


0.4.6 Dataset 3: Gapminder data

This data comes from the gapminder package which has an excerpt of data from gapminder.org circa 2010. It is a dataset used for exploring data and formatting it.

0.4.6.1 data/gapminder_wide.csv

This file covers variables related to country statistics tracking life expectancy, population size, and GDP per capita over years spanning from 1952 to 2007. We'll explore this dataset a little at the end of lecture.


0.5.0 Packages Used in This Lesson

The following packages are used in this lesson:

0.5.1 Quick review for how to install R packages in Anaconda

Open up the Anaconda prompt and use install your packages of interest.

conda install - starting command to call the installer for anaconda. -c conda-forge - look for the package in the 'conda-forge' channel. r-packagename - the name of the package you're interested in installing.

Combine the parts into a single command like:

conda install conda-forge r-gapminder

Note that 8 different packages are loaded with the tidyverse, and that 2 functions from the stats package have been replaced (superceded) by functions of the same name by dplyr. Note that you can still access the stats version of the function by calling it directly as stats::filter().


0.6.0 Data basics: The wide and long formats

0.6.1 Definitions:

0.6.2 Wide versus long format

Wide and long (sometimes un-stacked and stacked, or wide and tall, wide and narrow), are terms used to describe how a database is formatted.

In wide format, variables may be listed in the first column, each forming a row of its own. Observations may be presented as columns that fill observed values for each variable.

In long format, each variable is its own column, and the results of each measured variable are recorded in rows. In data science, long format is preferred over wide format because it allows for an easier and more efficient subset and manipulation of the data.

To read more about wide and long formats, visit https://eagereyes.org/basics/spreadsheet-thinking-vs-database-thinking.

wide_and_long_formats.png


0.7.0 Some data preparation

In this lesson we want to answer 3 simple questions:

Last lesson, we learned how to filter and select data subsets we were interested in. However, we can make data manipulation more efficient by controlling the overall structure or format of our data.

tell-me-more-about-your-data-wrangling-course_small.jpg

Let's read in our dataset, store it in a variable, and remind ourselves about the original structure.

To summarize, we see 52 rows of data, each with 82 columns.


1.0.0 Introduction to tidy data

Why tidy data?

Data cleaning (or dealing with 'messy' data, aka wrangling) accounts for a huge chunk of a data scientist's time. Ultimately, we want to get our data into a 'tidy' format (long format) where it is easy to manipulate, model and visualize. Having a consistent data structure and tools that work with that data structure can help this process along.

In Tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

This seems pretty straight forward, and it is. The datasets you get, however, will not be straightforward. Having a map of where to take your data is helpful in unraveling its structure and getting it into a usable format.

The 5 most common problems with messy datasets are:

Fortunately there are some tools available to solve these problems.

Observational units: Of the three rules, the idea of observational units might be the hardest to grasp. As an example, you may be tracking a puppy population across 4 variables: age, height, weight, fur colour. Each observation unit is a puppy. However, you might be tracking these puppies across multiple measurements - so a time factor applies. In that case, the observation unit now becomes puppy-time. In that case, each puppy-time measurement belongs in a different table (at least by tidy data standards). This, however, is a simple example and things can get more complex when taking into consideration what defines an observational unit. Check out this blog post by Claus O. Wilke for a little more explanation.

1.1.0 Introduction to the tidyverse

The tidyverse is the universe of packages created by Hadley Wickham for data analysis. There are packages to help import, tidy, transform, model and visualize data. His packages are pretty popular, so he made a package to load all of his packages at once. This wrapper package is tidyverse. In this lesson series we have used dplyr, readr and readxl, and we will be using dplyr and tidyr today.

tidyverse1.png

Hadley has a large fan-base. Someone even made a plot of Hadley using his own package, ggplot2.

HadleyObama2.png

Back to the normalverse...


1.2.0 Assessing our data frame

Which tidy data rules might our data frame break?

At first glance we can see that the column names are actually 3 different variables: 'Country', 'LatrineNumber', and 'Depth'. This information will likely be useful in our study, as we expect different bacteria at different depths, sites, and geographical locations. Each of these is a variable and should have its own separate column.

We could keep the column names as the sample names (as they are meaningful to the researcher) and add the extra variable columns, or we could make up sample names (ie. Sample_1) knowing that the information is not being lost, but rather stored in a more useful format.

Some of the Taxa also appear to have an additional variable of information (ie. _Gp1), but not all taxa have this information. We could also make a separate column for this information.

Each result is the same observational unit (i.e. relative abundances of bacteria), so having one table is fine.

1.3.0 Introduction to helpful functions in tidyr

tidyr is a package with functions that help us turn our 'messy' data into 'tidy' data. It has 2 major workhorse functions and 2 other tidying functions:

  1. pivot_longer - convert a data frame from wide to long format
  2. pivot_wider - convert a data frame from long to wide format
  3. separate() - split a column into 2 or more columns based on a string separator
  4. unite() - merge 2 or more columns into 1 column using a string separator

Note that pivot_longer() and pivot_wider() rely on unique key-value pairs to collapse or expand columns.

We've already loaded tidyverse which includes the tidyr package that the pivot_longer() function is from.


1.3.1 Gather your data from across columns using pivot_longer()

Previously called the gather() function, the updated pivot_longer() function is used to collect our columns in a straightforward way. As the name implies, this will pivot our dataset from a wide to a long format.

pivot_longer(
  data,
  cols,
  names_to = "name",
  names_prefix = NULL,
  names_sep = NULL,
  names_pattern = NULL,
  names_ptypes = list(),
  names_transform = list(),
  names_repair = "check_unique",
  values_to = "value",
  values_drop_na = FALSE,
  values_ptypes = list(),
  values_transform = list(),
  ...
)

We won't be using all of these arguments from pivot_longer() but there are a few we'll highlight here:

  1. data - our data frame (actually a tibble as we mentioned last class but close enough...)
  2. cols - this is set of columns we wish to pivot. For each observation, all their column names will be stored into a single column, and all of their values into another. As we'll see, this will greatly increase the number of observations in our data. There are many ways to define the parameter for this argument.
  3. names_to - a string or character vector used to name of the column(s) where we'll store our column names retrieved from the cols set.
  4. names_sep - if names_to is a character vector, this controls how the column names are parsed (broken up).
  5. values_to - this is the name of the column where we'll store the values retrieved from the cols set.

We've already imported taxa_pitlatrine_wide.csv and stored it as the variable data. Recall it is 52 rows (Taxa) and 82 columns of which there are 81 are observations coded as Country_LatrineNumber_Depth.

We'll use pivot_longer() to collect all the 81 observation columns and convert them into observation rows. The column names will be stored in a new variable called Site and the values from these columns will be in a variable called OTUs. Any untransformed columns become additional identifying data for each observation.

For now, we'll just step through the formatting process without saving the result into a variable.


1.3.1.1 There are many ways to specify cols in pivot_longer()

If we look at the documentation for the pivot_longer() function we see that the argument cols uses something called <tidy-select>. This is an argument modifier that indicates an additiona layer of syntax or language can be used to select variables (columns) based on their names.

This formatting includes some selection helpers like:

<tidy-select>: can be used in other tidyverse functions too and learning the syntax can help simplify more complex data wrangling steps. Learn more over at the tidyverse page!

For now we'll use some simpler examples to replicate what we did above.


In the above examples -1 means pivot every column except the 1st, or pivot every column except "Taxa". Taxa is still retained as a column but its values are not grouped in with 'Site' as an observation (i.e. we do not want 'Vietnam_9_4', 'Tanzania_2_9', and 'Clostridia' pivoted into the same column).

Let's save the last variation into a data frame called long_data.

1.3.1.2 What have we done using pivot_longer()?

Note how the dimensions of your dataframe have changed relative to data. Instead of 52 rows and 82 columns, we now have a data frame with 4,212 rows and 3 columns (which is the 81 columns we pivoted x 52 rows). long_data is now in a long format instead of wide.


1.3.2 separate() can split variables into multiple columns by specifying a text delimiter

Notice the information contained in our "Site" data? It is a combination of 3 values: country, latrine number, and depth information all separated with "_" between each.

We can use the separate() function to retrieve the Country, LatrineNumber, and Depth information from our Site variable. The separate() function takes in your dataframe, the name of the column to be split, the names of your new columns, and the character that you want to split the columns by (in this case an underscore `). Note that the default is to remove your original column - you can keep it by adding the argumentremove = FALSE`, keeping in mind that you now have redundant data.

separate(
  data,
  col,
  into,
  sep = "[^[:alnum:]]+",
  remove = TRUE,
  convert = FALSE,
  extra = "warn",
  fill = "warn",
  ...
)

We need to provide separate() with information to help split our variable.

  1. data - our data frame or tibble
  2. col - the name or position of the column we want to split.
  3. into - a character vector of the column names we want to split from argument col.
  4. sep - tells separate() how to break up the information in each row of col.

1.3.2.1 Remember to convert your variable types with the convert parameter

Looking at our output from above there's one little thing we missed and you'll have to consider when using functions like separate() for yourself - do you want to automatically convert your split columns to the correct data type? The defauly behaviour of separate() is to convert your columns to the character data type. You may, however, want it to try and convert in a less agnostic fashion, but this could also introduce NA values so beware! If you are splitting many variables, however, it may be easier to automatically convert and fix any inadvertant variable types afterwards.

For instance, we may want the variable Depth as a numeric for statistical analysis whereas Latrine_Number may seem more appropriate as a label. Let's try to use the convert logical parameter for us.

Let's fix that Latrine_Number variable while we are at it! We'll save the final form too


1.3.2.2 Many paths can lead to one destination: separate() versus pivot_longer()

We just worked through an excellent example of how to use the separate() function but in the case of splitting the Site variable into three new variables, we could have taken a different approach. Remember back in section 1.3.1 we mentioned the pivot_longer() argument names_sep. In this situation, the Site variable is derived from the column titles we used in pivot_longer().

With careful consideration, we could combine our above step into the pivot_longer() step which can do an internal separate() call under the hood. This is obviously a common enough occurence in the data science community that this feature was added to pivot_longer().

To use this feature, we just need to specify the new column information correctly in the names_to parameter along with our names_sep parameter.


See? Two sets of code, but we get exactly the same result!

two_code_spiderman.jpg


Now getting back to our data, we may also want to split the Taxa column off based on the '_Gp' (Group) of Acidobacteria and other entries. We cannot do this step using pivot_longer() since this is an original variable untouched by the pivot_longer() process.

Let's try the code with separate() but do not save the answer in a variable.

Notice that we have triggered a warning from R that it has filled in 'NA' for the bacteria that did not have groups. Also see that we chose to split Taxa using '_Gp' since I did not need 'Gp'.


1.3.3 Challenge: compare glimpse() and str() information for our split data

Use the glimpse() function to look at the type of each variable in our new data frame, split_gathered_data. Are those the types you expected? Why or why not? How is glimpse() different from the str() function?



1.3.4 Recall that we can group_by() to organize our dataset

Last lecture we learned a useful function group_by() that group data based on one or more variables. This is useful for calculations and plotting on subsets of your data without having to turn your variables into factors.

Suppose we wanted to look at a combination of Country and Well Depth. While visually, you wouldn't notice any changes to your data frame, if you look at the structure it will now be a grouped_df. There are 15 groupings resulting from Country and Depth. After we have performed our desired operation, we can return our data frame to its original structure by calling ungroup().

First we will examine the structure of grouped and ungrouped output without any additional operations.

Note the additional attribute called groups in our grouped data? What do you think the values stored here represent?


1.3.5 We can combine group_by() and summarise() functions to produce sensible stat data

Recall last lecture we briefly used the group_by() and summarise() functions in combination to produce some preliminary statistics on our data. With this much larger set of data, this is a perfect time to revisit that combination of functions.

Whereas, in our initial messy data frame it was difficult to do calculations based on Country, Latrine Number or Well Depth, this is now an easy task. Let's get the mean, median, standard deviation and maximum value for the number of OTUs collected in Tanzania vs Vietnam.


In dealing with grouped data, we no longer have to grab a Country by subsetting or using helper functions to grab letters from their names. group_by() produces that modified data frame which recognizes that we have 2 countries and will perform calculations for both of them.

2.0.0 Answering questions about our data

Now that we have tidy data, let's proceed to answering our questions:

2.1.0 Question: Which latrine has the greatest mean number of OTUs?

What steps do we need to take to answer this question? Before we dive in, let's consider what we need to answer this question.

  1. We need to know group data from the same latrine together.
  2. We need the mean OTU value from each latrine group.
  3. We need to sort or identify the mean OTU with the highest number

Recall our original data set had three variables that were mashed together into a Site (Country_LatrineNo_Depth) and answering our question would have been difficult. With tidy data format we have split a combined variable column into three, making Latrine_Number a variable. This allowed us to simply group our data to perform our mean calculation and retrieve an answer.


2.2.0 Question: Is there more Clostridia in Tanzania or Vietnam?

This is a bit of variation on our first question. What kind of data do we need?

  1. A filtered data set where we only look at Clostridia entries.
  2. We need to group based on country.
  3. A sorted summary of the OTU data.

Again, being able to filter by Taxa and group by Country (as an isolated variable) helps a lot. With dplyr syntax we can perform all data manipulations and calculations in a code block that is readable.

Note that we grouped by both Country and Taxa. This was done more for visual purposes to confirm that Clostridia was the only taxa coming out of our filter. We could easily have done group_by(Country) with the same result. Our final output, however, would not include the Taxa column.


2.3.0 Question: Which site had the greatest number of Taxa represented?

Lets break down the question again into the basic components!

  1. A sampling 'site' is unique to a Country, Latrine Number, and Depth so we will need to sort/group using these aspects.
  2. Exclude any taxa that weren't observed at our site.
  3. Use the n() command from dplyr to count a group out.
  4. Sort the data to find the most abundant site.

Since we can group by the 3 variables that were in the Site name, there is no disadvantage to having our data in tidy format compared to our original wide data frame. However now we are able to filter for non-zero OTUs, which was impossible in the wide format. Since we know from earlier in the lesson that each Taxa is only represented once for each site, we only have to count and order the number of observations to get our answer.


3.0.0 Getting back to the way we were

To get data back into its original format, there are reciprocal functions in the tidyr package, making it possible to switch between wide and long formats.

Fair question: But you've just been telling me how great the 'long' format is?!?! Why would I want the wide format again???

Honest answer: Note that our original data frame was 52 rows and expanded to 4212 rows in the long format. When you have, say, a genomics dataset you might end up with 6,000 rows expanding to 600,000 rows. You probably want to do your calculations and switch back to the more "human writeable/readable" format. Sure, I can save a data frame with 600,000 rows, but I can't really send it to anyone because spreadsheet software such as Excel might crash while trying to open the file.

3.1.0 unite() your columns back again

unite(
  data, 
  col, 
  ..., 
  sep = "_", 
  remove = TRUE, 
  na.rm = FALSE
)

The opposite of separate(), we need to provide unite() with information to help consolidate our information.

  1. The first argument, data, is our data frame (aka tibble)
  2. the second argument, col, is the name of the column where we want to keep our combined data.
  3. The next argument, "...", is a list of the column names we want to join into argument 2.
  4. The fourth argument, sep, tells unite() what kind of character to put between recombined data values going into col.

Let's turn back time and collapse Country, Latrine_Number and Depth back into one variable, 'Site', using the unite() function. Store the output in a data frame called united_data.


3.2.0 Use pivot_wider() to convert your data from long to wide format

pivot_wider(
  data,
  id_cols = NULL,
  names_from = name,
  names_prefix = "",
  names_sep = "_",
  names_glue = NULL,
  names_sort = FALSE,
  names_repair = "check_unique",
  values_from = value,
  values_fill = NULL,
  values_fn = NULL,
  ...
)

The opposite of pivot_longer(), we need to provide pivot_wider() with information to help consolidate our information. This can be tricky to conceptualize BUT the goal is to consolidate row entries based on specific columns that we do NOT name.

  1. data - our data frame (aka tibble) that we wish to convert.
  2. id_cols - the unique column(s) that will form the basis of identifying each observation. The default identifiers are any unselected columns from names_from and values_from.
  3. names_from - a column or multiple columns from which to pivot back to a wider format. These will become variable names whose values will come from...
  4. values_from - pairs with the names_from parameter to fill the new variable columns formed by names_from.

Let's turn the hands of time back further and use the pivot_wider() function to turn united_data into the wide shape of our original dataset. Save the output into a data frame called wide_data.


3.3.0 Save our data frame to a text file

We'll use the standard write_csv command to save our results to our data folder.


4.0.0 Example challenges for working with tidy data

We've gone full circle from pivoting a wide format dataset to long format, splitting multivariable columns and summarizing on the observations, and reverting the whole thing back to its original form. Now it's time to practice a bit more.

Brace_for_examples.jpg

4.1.0 Filter and subset united_data

  1. Filter united_data to remove all non-zero values, and store it as united_data_no_zeros.
  2. Sort united_data_no_zeros to keep the top 20 rows with the highest OTUs.

4.1.1 How many rows in united_data had a value of zero?

Let's alter our previous code to answer the question of just how many observations in united_data had a value of 0 in the OTUs variable.


4.2.0 Convert microbes_wide.csv to long format

Reshape microbes_wide.csv into a long format version called microbes_long. Make sure the columns are in the following order: "ASV", "salinity", "compound", "group", "replicate", "kingdom", "phylum", "class", "order", "family", "genus", and "abundance".

Once you're done, write microbes_long as microbes_long.csv.

Notes about microbes_wide.csv:

  1. Each observation name is a combination of the salinity, compound, group, and replicate
  2. Each taxa is a combination of kingdom, phylum, class, order, family, and genus
  3. Measured values for each observation are overall abundance of the specific microbe.

Examining our data, the first two columns hold sequence and taxa information. The remaining 28 hold our multivariable data, and Let's begin by pivoting to long format before breaking up the multivariable columns of sample_name and taxa.


Our column are in the wrong order! We need to shuffle "salinity", "compound", "group", and "replicate" towards the left. Relocate columns to match this order: "ASV", "salinity", "compound", "group", "replicate", "kingdom", "phylum", "class", "order", "family", "genus", and "abundance".


Is there a dplyr function we could use instead? Take some time to think about it!

Now just write the data frame to file as microbes_long.csv.


4.3.0 Convert gapminder_wide to a long format

Read in the gapminder_wide.csv. What rules of tidy data does it break? Transform the dataset to the format below.

continent country year lifeExp pop gdpPercap
Asia Afghanistan 1952 28.801 8425333 779.4453
Asia Afghanistan 1957 30.332 9240934 820.8530
Asia Afghanistan 1962 31.997 10267083 853.1007
Asia Afghanistan 1967 34.020 11537966 836.1971
Asia Afghanistan 1972 36.088 13079460 739.9811
Asia Afghanistan 1977 38.438 14880372 786.1134
  1. How many rows do you have?
  2. Save the newly reshaped document as gapminder_long

Reshape gapminder_wide into a long, cleansed format. BEFORE you start writing any code, IDENTIFY the formatting deficiencies and then PLAN ahead what you want to achieve.


4.3.1 Issues with gapminder_wide

Observations appear to be split into visible categories based on continent and country (variables 1 and 2) the remaining columns are a combination of:

  1. gdpPercap_year - the GDP per capita for a specific year in that continent/country combination
  2. lifeExp_year - the life expectancy for a specfific year in that continent/country combination
  3. pop_year - the population for a specific year in that continent/country combination

So we are seeing multiple observations per column AND multiple observation types in the table (GPP, life expectancy and population)! What a mess!!


4.3.2 Break up a dual-information variable

We've generated a long format table but we still have obs_type as a variable which holds two kinds of information:

  1. The observation type (gdpPercap, lifeExp, and pop).
  2. The year of that observation.

We need to break that up into separate variables.


4.3.3 Use pivot_wider() to widen your table based on a mixed category observation

We've successfully separate our dual-variable in obs_type and year. However the three observation types are still trapped in the same table and column! Let's fix this so that each observation type (and it's value) becomes it's own variable. We'll have to take a small step back before we can take a step forward. Send it back into slightly wider format!


4.3.4 Use select() to rearrange your columns!

We now have 1/3 as many observations (rows) in our data frame because that data has been pivoted into 3 new columns! Just what we wanted. Nearly there! Let's review what we want as a final data table.

continent country year lifeExp pop gdpPercap
Asia Afghanistan 1952 28.801 8425333 779.4453
Asia Afghanistan 1957 30.332 9240934 820.8530
Asia Afghanistan 1962 31.997 10267083 853.1007
Asia Afghanistan 1967 34.020 11537966 836.1971
Asia Afghanistan 1972 36.088 13079460 739.9811
Asia Afghanistan 1977 38.438 14880372 786.1134

Let's put that into order now and it looks like we've sorted our data by country name rather than continent. Then we'll save it to file.


4.4.0 Make a basic exploratory plot with ggplot

Exploratory plots allow us to quickly answer questions about our data or to get an idea of trends and data distribution. This week the DataCamp assignment will have you making a few basic plots with the data that you format. We'll be discussing plots using the ggplot package in much more detail next lecture. For now, here is an overview for producing a basic line or scatterplot using ggplot.

4.4.1 Every plot needs data, aesthetics, and a geom_*()

To produce a ggplot object, you need to assign a minimum of 3 aspects:

  1. ggplot(): this initializes a ggplot object, and requires a dataset of some kind. You set it with the argument data.
  2. aes(): this sets the details of how your data will be used. For example the x and y axes are set here, if applicable. You can also define how data are grouped with arguments like colour, fill, and size, setting these using variable names from data.
  3. geom_*(): or geometric object defines the type of plot you want to visualize. The simplest verions are geom_line() (line graph), geom_bar() (bar chart), and geom_point() (scatterplots, dotplots, and bubble charts).

Let's filter some gapminder_long data to plot as a line graph and scatter plot. From the results we may be able to glean some insights!


It looks like life expectancy is generally on the increase in countries throughout Asia. What about the life expectancy across the world based on factors like GDP per capita (ie how rich a country is)?


With just a few lines of code, we can see that the overall trend suggests that life expectancy is positively correlated with GDP. It's not always the case, from the look of some African countries, but that's the general trend we see from a 2007 snapshot of data. Next week we'll learn even more about using ggplot to visualize our data and customize its presentation.


5.0.0 Class summary

That's the end for our third class on R! You've made it through and we've learned about the following:

  1. The tidy data format and philosophy.
  2. Converting wide format data to long format.
  3. Splitting multivariable columns
  4. Grouping and summarising data.
  5. Pivoting a multiple observation column to single observation variables
  6. Returning data to a wide format
  7. A small taste of visualizing data with ggplot

5.1.0 Post-lecture assessment (12% of final grade)

Soon after the end of this lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete two chapters from the Reshaping Data with tidyr course: Tidy data (950 points) and From wide to long and back (1300 points). This is a pass-fail assignment, and in order to pass you need to achieve a least 1,687 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.

In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the entire course summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course and you'll want to expand each section. It should look something like this:

DataCampIntroR.jpg

You may need to take several screenshots if you cannot print it all in a single try. Submit the file(s) or a combined PDF for the homework to the assignment section of Quercus. By submitting your scores for each section, and chapter, we can keep track of your progress, identify knowledge gaps, and produce a standardized way for you to check on your assignment "grades" throughout the course.

You will have until 13:59 hours on Thursday, October 7th to submit your assignment (right before the next lecture).


5.2.0 Acknowledgements

Revision 1.0.0: materials prepared in R Markdown by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.1.0: edited and preprared in Jupyter Notebook by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


5.3.0 Your DataCamp academic subscription

This class is supported by DataCamp, the most intuitive learning platform for data science and analytics. Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 350+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 6 million learners around the world and close your skills gap.

Your DataCamp academic subscription grants you free access to the DataCamp's catalog for 6 months from the beginning of this course. You are free to look for additional tutorials and courses to help grow your skills for your data science journey. Learn more (literally!) at DataCamp.com.

DataCampLogo.png


5.4.0 Resources

https://github.com/wmhall/tidyr_lesson/blob/master/tidyr_lesson.md
http://vita.had.co.nz/papers/tidy-data.pdf
https://thinkr.fr/tidyverse-hadleyverse/
http://stat545.com/bit001_dplyr-cheatsheet.html
http://dplyr.tidyverse.org/articles/two-table.html

CAGEF_new.png